[PATCH] firmware: fw_base: Improve exception stack setup in trap handler

Anup Patel anup at brainfault.org
Tue Aug 11 09:40:13 EDT 2020


On Tue, Aug 11, 2020 at 6:05 PM Jessica Clarke <jrtc27 at jrtc27.com> wrote:
>
> On 11 Aug 2020, at 04:10, Anup Patel <anup at brainfault.org> wrote:
> >
> > On Mon, Aug 10, 2020 at 7:35 PM Jessica Clarke <jrtc27 at jrtc27.com> wrote:
> >>
> >> On 10 Aug 2020, at 14:52, Anup Patel <Anup.Patel at wdc.com> wrote:
> >>>
> >>> Currently, the low-level trap handler (i.e. _trap_handler()) uses
> >>> branch instructions to conditionally setup exception stack based
> >>> on which mode trap occured.
> >>>
> >>> This patch implements exception stack setup using xor instructions
> >>> which is 2 instructions less compared to previous approach and faster
> >>> due to lack of branch instructions.
> >>>
> >>> The new exception stack setup approach can be best described by the
> >>> following pseudocode:
> >>>
> >>> Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1
> >>
> >> This relies on PRIV 0b10 currently being unallocated, which doesn't
> >> make me feel warm and fuzzy inside.
> >
> > The PRIV 0b10 mode is not used by any of the upcoming extensions. In future,
> > we will require really strong justification in future to use the
> > unused 0b10 mode.
> > Even H-extension has cleanly avoided using PRIV 0b10 mode.
> >
> > In any case, we should document this assumption.
> >
> >>
> >>> Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP))
> >>
> >> This relies on MUL with 0/1 being fast, but that may well be a different
> >
> > The MUL can be replaced with AND instruction but it was taking one
> > additional instruction. See below.
> >
> >    /*
> >     * Set T0 to appropriate exception stack
> >     *
> >     * Came_From_M_Mode = ((MSTATUS.MPP & 0x2) >> 1) - 1
> >     * Exception_Stack = SP ^ (Came_From_M_Mode & (TP ^ SP))
> >     *
> >     * Came_From_M_Mode = -1UL   ==>    Exception_Stack = TP
> >     * Came_From_M_Mode = 0UL    ==>    Exception_Stack = SP
> >     */
> >    csrr    t0, CSR_MSTATUS
> >    srl    t0, t0, (MSTATUS_MPP_SHIFT + 1)
> >    and    t0, t0, 0x1
> >    add    t0, t0, -1
> >    xor    tp, tp, sp
> >    and    t0, t0, tp
> >    xor    tp, tp, sp
> >    xor    t0, sp, t0
> >
> >> pipeline in your processor (in an OoO core) with fewer available, or a
> >> higher-latency path to an M-box in a simpler in-order pipeline. The MUL
> >> is also not compressible. You probably do want a branch, but in a way
> >> that macro-op fusion will hopefully turn it into a conditional move:
> >>
> >>    mv Exception_Stack, tp
> >>    beqz Came_From_M_Mode, 1f
> >>    mv Exception_Stack, sp
> >> 1:
> >
> > Actually, I had started off with branches but code without branches (this
> > patch) is performing better on QEMU (because fewer TCG blocks required
> > for translating _trap_handler()). This code (this patch) will be much faster
> > on real HW because branch misses can cause bubbles in the instruction
> > pipeline.
>
> I don't think you properly read what I wrote. I specifically made it so
> there was a single branch to skip a single instruction and therefore a
> macro-op fusion candidate. That means the branch+move is turned into a
> conditional move, with no need to branch predict (and potentially
> suffer from a miss). I don't know whether QEMU recognises the pattern,
> but if not that's just a deficiency in QEMU itself, and fixing that
> would suddenly make my proposed solution faster than the one involving
> multiplication. You have also neglected to address the fact that my
> proposed code is more compressible than yours.

I don't think you understand the problem here. We have to choose a
an appropriate exception stack without modifying current SP because
current SP will have to be first saved on exception stack. That's why
we:
1. Compute and set exception stack in T0
2. Save current SP on exception stack pointed by T0
3. Change current SP to T0

This patch is only trying to optimize point1.

Also, the place where the exception stack is chosen, we can only
clobber T0 for computation to preserve state of other general purpose
registers.

Your proposal if using branch+move won't work because you need
two general purpose registers one for Exception_Stack and second
for Came_From_M_Mode.

>
> Optimising for QEMU is also not the right approach in my opinion; we
> should optimise for hardware first and foremost if we ever want RISC-V
> to actually succeed in the real world. Real hardware does macro-op
> fusion in many cases, and even in the cases it doesn't, multiplication
> is not cheap. In a classic 5-stage pipeline, multiplication is done a
> stage after execute, and may well be multi-cycle even for multiplying
> by 0/1, which would cause the pipeline to *always* stall, rather than
> *sometimes* in the case of a branch mispredict if macro-op fusion is
> not implemented. Given it's a direct branch to a nearby instruction,
> I'd expect both the misprediction and the corrected prediction to be L1
> I-cache hits, so an in-order core should see very little latency. A
> good out-of-order core like SonicBOOM will correctly decode it to a
> conditional move (in fact I believe it decodes short conditional
> branches as being a series of predicated instructions).

Nope, we are not optimizing just for QEMU. Replacing branches
with simple sequential 1-cyle ALU instructions (not MUL instruction)
is faster on simple in-order CPUs as well. The biggest problem with
branches is the wrong path taken by branch predictor which causes
pipeline flush.

>
> Also, if MUL with 0/1 were faster, do you not think compilers would
> already be emitting that for RISC-V? Because both GCC and LLVM will use
> (compressed) branch + (compressed) move to do a conditional move

I am not pushing for MUL instruction here. I am very well aware of the
cost of MUL instructions. Please look at my alternate proposal where
I use AND instruction in-place of MUL with slight modifications.

Also, we can't compare the low-level trap_handler with general
C context in which GCC/LLVM optimizes code where stack pointer
is already set.

The challenging part here is to select exception stack using only
one temporary (T0) general purpose register.

>
> > The code with branches is as follows:
> >    csrr    t0, CSR_MSTATUS
> >    srl    t0, t0, (MSTATUS_MPP_SHIFT + 1)
> >    and    t0, t0, 0x1
> >    beq    t0, zero, _trap_stack_s_mode
> > _trap_stack_m_mode:
> >    add    t0, sp, zero
> >    j    _trap_stack_done
> > _trap_stack_s_mode:
> >    add    t0, tp, zero
> > _trap_stack_done:
>
> Again, that's not what I wrote. Mine was shorter with only a single
> branch.

I think you missed the context in-which this optimization is being done.

Please see above.

Let me send V2 with some of your comments addressed.

Regards,
Anup

>
> Jess
>
> >>
> >> Same number of instructions, but all compressible, no multiplications
> >> required and the lone branch can be macro-op fused.
> >
> > Overall, the key thing is having no (or fewer) branches in _trap_handler()
> > with the same number of (or fewer) instructions.
> >
> > Regards,
> > Anup
> >
> >>
> >> Jess
> >>
> >>> Came_From_M_Mode = 0    ==>    Exception_Stack = TP
> >>> Came_From_M_Mode = 1    ==>    Exception_Stack = SP
> >>>
> >>> Signed-off-by: Anup Patel <anup.patel at wdc.com>
> >>> ---
> >>> firmware/fw_base.S | 49 ++++++++++++++++++++--------------------------
> >>> 1 file changed, 21 insertions(+), 28 deletions(-)
> >>>
> >>> diff --git a/firmware/fw_base.S b/firmware/fw_base.S
> >>> index b66ac41..53a9a48 100644
> >>> --- a/firmware/fw_base.S
> >>> +++ b/firmware/fw_base.S
> >>> @@ -490,35 +490,28 @@ _trap_handler:
> >>>      /* Save T0 in scratch space */
> >>>      REG_S   t0, SBI_SCRATCH_TMP0_OFFSET(tp)
> >>>
> >>> -     /* Check which mode we came from */
> >>> +     /*
> >>> +      * Set T0 to appropriate exception stack
> >>> +      *
> >>> +      * Came_From_M_Mode = (MSTATUS.MPP & 0x2) >> 1
> >>> +      * Exception_Stack = TP ^ (Came_From_M_Mode * (SP ^ TP))
> >>> +      *
> >>> +      * Came_From_M_Mode = 0    ==>    Exception_Stack = TP
> >>> +      * Came_From_M_Mode = 1    ==>    Exception_Stack = SP
> >>> +      */
> >>>      csrr    t0, CSR_MSTATUS
> >>> -     srl     t0, t0, MSTATUS_MPP_SHIFT
> >>> -     and     t0, t0, PRV_M
> >>> -     xori    t0, t0, PRV_M
> >>> -     beq     t0, zero, _trap_handler_m_mode
> >>> -
> >>> -     /* We came from S-mode or U-mode */
> >>> -_trap_handler_s_mode:
> >>> -     /* Set T0 to original SP */
> >>> -     add     t0, sp, zero
> >>> -
> >>> -     /* Setup exception stack */
> >>> -     add     sp, tp, -(SBI_TRAP_REGS_SIZE)
> >>> -
> >>> -     /* Jump to code common for all modes */
> >>> -     j       _trap_handler_all_mode
> >>> -
> >>> -     /* We came from M-mode */
> >>> -_trap_handler_m_mode:
> >>> -     /* Set T0 to original SP */
> >>> -     add     t0, sp, zero
> >>> -
> >>> -     /* Re-use current SP as exception stack */
> >>> -     add     sp, sp, -(SBI_TRAP_REGS_SIZE)
> >>> -
> >>> -_trap_handler_all_mode:
> >>> -     /* Save original SP (from T0) on stack */
> >>> -     REG_S   t0, SBI_TRAP_REGS_OFFSET(sp)(sp)
> >>> +     srl     t0, t0, (MSTATUS_MPP_SHIFT + 1)
> >>> +     and     t0, t0, 0x1
> >>> +     xor     sp, sp, tp
> >>> +     mul     t0, t0, sp
> >>> +     xor     sp, sp, tp
> >>> +     xor     t0, tp, t0
> >>> +
> >>> +     /* Save original SP on exception stack */
> >>> +     REG_S   sp, (SBI_TRAP_REGS_OFFSET(sp) - SBI_TRAP_REGS_SIZE)(t0)
> >>> +
> >>> +     /* Set SP to exception stack and make room for trap registers */
> >>> +     add     sp, t0, -(SBI_TRAP_REGS_SIZE)
> >>>
> >>>      /* Restore T0 from scratch space */
> >>>      REG_L   t0, SBI_SCRATCH_TMP0_OFFSET(tp)
> >>> --
> >>> 2.25.1
> >>>
> >>>
> >>> --
> >>> opensbi mailing list
> >>> opensbi at lists.infradead.org
> >>> http://lists.infradead.org/mailman/listinfo/opensbi
>



More information about the opensbi mailing list