[PATCH v2 3/6] riscv: ftrace: prepare ftrace for atomic code patching

Wed Sep 11 03:57:17 PDT 2024

On Wed, Aug 14, 2024 at 02:57:52PM +0200, Björn Töpel wrote:
> Björn Töpel <bjorn at kernel.org> writes:
> 
> > Andy Chiu <andy.chiu at sifive.com> writes:
> >
> >> We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since
> >> instruction fetch can break down to 4 byte at a time, it is impossible
> >> to update two instructions without a race. In order to mitigate it, we
> >> initialize the patchable entry to AUIPC + NOP4. Then, the run-time code
> >> patching can change NOP4 to JALR to eable/disable ftrcae from a
> >                                       enable        ftrace
> >
> >> function. This limits the reach of each ftrace entry to +-2KB displacing
> >> from ftrace_caller.
> >>
> >> Starting from the trampoline, we add a level of indirection for it to
> >> reach ftrace caller target. Now, it loads the target address from a
> >> memory location, then perform the jump. This enable the kernel to update
> >> the target atomically.
> >
> > The +-2K limit is for direct calls, right?
> >
> > ...and this I would say breaks DIRECT_CALLS (which should be implemented
> > using call_ops later)?
> 
> Thinking a bit more, and re-reading the series.
> 
> This series is good work, and it's a big improvement for DYNAMIC_FTRACE,
> but
> 
> +int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
> +{
> +	unsigned long distance, orig_addr;
> +
> +	orig_addr = (unsigned long)&ftrace_caller;
> +	distance = addr > orig_addr ? addr - orig_addr : orig_addr - addr;
> +	if (distance > JALR_RANGE)
> +		return -EINVAL;
> +
> +	return __ftrace_modify_call(rec->ip, addr, false);
> +}
> +
> 
> breaks WITH_DIRECT_CALLS. The direct trampoline will *never* be within
> the JALR_RANGE.

Yes, it is hardly possible that a direct trampoline will be within the
range.

Recently I have been thinking some solutions to address the issue. One
solution is replaying AUIPC at function entries. The idea has two sides.
First, if we are returning back to the second instruction at trap return,
then do sepc -= 4 so it executes the up-to-date AUIPC. The other side is
to fire synchronous IPI that does remote fence.i at right timings to
prevent concurrent executing on a mix of old and new instructions.

Consider replacing instructions at a function's patchable entry with the
following sequence:

Initial state:
--------------
0: AUIPC
4: JALR

Step1:
write(0, "J +8")
fence w,w
send sync local+remote fence.i
------------------------
0: J +8
4: JALR

Step2:
write(4, "JALR'")
fence w,w
send sync local+remote fence.i
------------------------
0: J +8
4: JALR'

Step3:
write(0, "AUIPC'")
fence w,w
send sync local+remote fence.i (to activate the call)
-----------------------
0: AUIPC'
4: JALR'

The following execution sequences are acceptable:
- AUIPC, JALR
- J +8, (skipping {JALR | JALR'})
- AUIPC', JALR'

And here are sequences that we want to prevent:
- AUIPC', JALR
- AUIPC, JALR'

The local core should never execute the forbidden sequence.

By listing all possible combinations of executing sequence on a remote
core, we can find that the dangerous seqence is impossible to happen:

let f be the fence.i at step 1, 2, 3. And let numbers be the location of
code being executed. Mathematically, here are all combinations at a site
happening on a remote core:

fff04 -- updated seq
ff0f4 -- impossible, would be ff0f04, updated seq
ff04f -- impossible, would be ff08f, safe seq
f0ff4 -- impossible, would be f0ff04, updated seq
f0f4f -- impossible, would be f0f08f (safe), or f0f0f04 (updated)
f04ff -- impossible, would be f08ff, safe seq
0fff4 -- impossible, would be 0fff04, updated seq
0ff4f -- impossible, would be 0ff08f (safe), or 0ff0f04 (updated)
0f4ff -- impossible, would be 0f08ff (safe), 0f0f08f (safe), 0f0f0f04 (updated)
04fff -- old seq

After the 1st 'fence.i', remote cores should observe (J +8, JALR) or (J +8, JALR')
After the 2nd 'fence.i', remote cores should observe (J +8, JALR') or (AUIPC', JALR')
After the 3rd 'fence.i', remote cores should observe (AUIPC', JALR')

Remote cores should never execute (AUIPC',JALR) or (AUIPC,JALR')

To correctly implement the solution, the trap return code must match JALR
and adjust sepc only for patchable function entries. This is undocumently
possible because we use t0 as source and destination registers for JALR
at function entries. Compiler never generates JALR that uses the same
register pattern.

Another solution is inspired by zcmt, and perhaps we can optimize it if
the hardware does support zcmt. First, we allocate a page and divide it
into two halves. The first half of the page are 255 x 8B destination
addresses. Then, starting from offset 2056, the second half of the page
is composed by a series of 2 x 4 Byte instructions:

0:	ftrace_tramp_1
8:	ftrace_tramp_2
...
2040:	ftrace_tramp_255
2048:	ftrace_tramp_256 (not used when configured with 255 tramps)
2056:
ld	t1, -2048(t1)
jr	t1
ld	t1, -2048(t1)
jr	t1
...
4088:
ld	t1, -2048(t1)
jr	t1
4096:

It is possible to expand to 511 trampolines by adding a page
below, and making a load+jr sequence from +2040 offset.

When the kernel boots, we direct AUIPCs at patchable entries to the page,
and disable the call by setting the second instruction to NOP4. Then, we
can effectively enable/disable/modify a call by setting only the
instruction at JALR. It is possible to utilize most of the current patch
set to achieve atomic patching. A missing part is to allocate and manage
trampolines for ftrace users.

Thanks,
Andy