[PATCH RFC v2 riscv/for-next 0/5] Enable ftrace with kernel preemption for RISC-V
Andy Chiu
andy.chiu at sifive.com
Wed Mar 20 09:38:00 PDT 2024
On Tue, Mar 19, 2024 at 11:32 PM Evgenii Shatokhin
<e.shatokhin at yadro.com> wrote:
>
> Hi,
>
> On 18.03.2024 18:31, Andy Chiu wrote:
> > Hi Evgenii,
> >
> > Thanks for your help!
>
> You are welcome!
>
> >
> > I just rebased upon 6.8-rc1 and passed the stress-ng + ftrace/nop
> > testing. I will add some random tracers to test and some optimization
> > before sending out again. Here are a few things needed:
> >
> > On Thu, Feb 22, 2024 at 12:55 AM Evgenii Shatokhin
> > <e.shatokhin at yadro.com> wrote:
> >>
> >> On 21.02.2024 08:27, Andy Chiu wrote:
> >>>
> >>> On Wed, Feb 14, 2024 at 3:42 AM Evgenii Shatokhin <e.shatokhin at yadro.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> On 13.09.2022 12:42, Andy Chiu wrote:
> >>>>> This patch removes dependency of dynamic ftrace from calling
> >>>>> stop_machine(), and makes it compatiable with kernel preemption.
> >>>>> Originally, we ran into stack corruptions, or execution of partially
> >>>>> updated instructions when starting or stopping ftrace on a fully
> >>>>> preemptible kernel configuration. The reason is that kernel periodically
> >>>>> calls rcu_momentary_dyntick_idle() on cores waiting for the code-patching
> >>>>> core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
> >>>>> marked as notrace, it would call a bunch of tracable functions if we
> >>>>> configured the kernel as preemptible. For example, these are some functions
> >>>>> that happened to have a symbol and have not been marked as notrace on a
> >>>>> RISC-V preemptible kernel compiled with GCC-11:
> >>>>> - __rcu_report_exp_rnp()
> >>>>> - rcu_report_exp_cpu_mult()
> >>>>> - rcu_preempt_deferred_qs()
> >>>>> - rcu_preempt_need_deferred_qs()
> >>>>> - rcu_preempt_deferred_qs_irqrestore()
> >>>>>
> >>>>> Thus, this make it not ideal for us to rely on stop_machine() and
> >>>>> handly marked "notrace"s to perform runtime code patching. To remove
> >>>>> such dependency, we must make updates of code seemed atomic on running
> >>>>> cores. This might not be obvious for RISC-V since it usaually uses a pair
> >>>>> of AUIPC + JALR to perform a long jump, which cannot be modified and
> >>>>> executed concurrently if we consider preemptions. As such, this patch
> >>>>> proposed a way to make it possible. It embeds a 32-bit rel-address data
> >>>>> into instructions of each ftrace prologue and jumps indirectly. In this
> >>>>> way, we could store and load the address atomically so that the code
> >>>>> patching core could run simutaneously with the rest of running cores.
> >>>>>
> >>>>> After applying the patchset, we compiled a preemptible kernel with all
> >>>>> tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU virt
> >>>>> machine. The kernel could boot up successfully, passing all ftrace
> >>>>> testsuits. Besides, we ran a script that randomly pick a tracer on every
> >>>>> 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
> >>>>> contrast, a preemptible kernel without our patch would panic in few
> >>>>> rounds on the same machine.
> >>>>>
> >>>>> Though we ran into errors when using hwlat or irqsoff tracers together
> >>>>> with cpu-online stressor from stress-ng on a preemptible kernel. We
> >>>>> believe the reason may be that percpu workers of the tracers are being
> >>>>> queued into unbounded workqueue when cpu get offlined and patches will go
> >>>>> through tracing tree.
> >>>>>
> >>>>> Additionally, we found patching of tracepoints unsafe since the
> >>>>> instructions being patched are not naturally aligned. This may result in
> >>>>> 2 half-word stores, which breaks atomicity, during the code patching.
> >>>>>
> >>>>> changes in patch v2:
> >>>>> - Enforce alignments on all functions with a compiler workaround.
> >>>>> - Support 64bit addressing for ftrace targets if xlen == 64
> >>>>> - Initialize ftrace target addresses to avoid calling bad address in a
> >>>>> hypothesized case.
> >>>>> - Use LGPTR instead of SZPTR since .align is log-scaled for
> >>>>> mcount-dyn.S
> >>>>> - Require the nop instruction of all jump_labels aligns naturally on
> >>>>> 4B.
> >>>>>
> >>>>> Andy Chiu (5):
> >>>>> riscv: align ftrace to 4 Byte boundary and increase ftrace prologue
> >>>>> size
> >>>>> riscv: export patch_insn_write
> >>>>> riscv: ftrace: use indirect jump to work with kernel preemption
> >>>>> riscv: ftrace: do not use stop_machine to update code
> >>>>> riscv: align arch_static_branch function
> >>>>>
> >>>>> arch/riscv/Makefile | 2 +-
> >>>>> arch/riscv/include/asm/ftrace.h | 24 ----
> >>>>> arch/riscv/include/asm/jump_label.h | 2 +
> >>>>> arch/riscv/include/asm/patch.h | 1 +
> >>>>> arch/riscv/kernel/ftrace.c | 179 ++++++++++++++++++++--------
> >>>>> arch/riscv/kernel/mcount-dyn.S | 69 ++++++++---
> >>>>> arch/riscv/kernel/patch.c | 4 +-
> >>>>> 7 files changed, 188 insertions(+), 93 deletions(-)
> >>>>>
> >>>>
> >>>> First of all, thank you for working on making dynamic Ftrace robust in
> >>>> preemptible kernels on RISC-V.
> >>>> It is an important use case but, for now, dynamic Ftrace and related
> >>>> tracers cannot be safely used with such kernels.
> >>>>
> >>>> Are there any updates on this series?
> >>>> It needs a rebase, of course, but it looks doable.
> >>>>
> >>>> If I understand the discussion correctly, the only blocker was that
> >>>> using "-falign-functions" was not enough to properly align cold
> >>>> functions and "-fno-guess-branch-probability" would likely have a
> >>>> performance cost.
> >>>>
> >>>> It seems, GCC developers have recently provided a workaround for that
> >>>> (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326,
> >>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).
> >>>>
> >>>> "-fmin-function-alignment" should help but, I do not know, which GCC
> >>>> versions have got that patch already. In the meantime, one could
> >>>> probably check if "-fmin-function-alignment" is supported by the
> >>>> compiler and use it, if it is.
> >>>>
> >>>> Thoughts?
> >>>
> >>> Hi Evgenii,
> >>>
> >>> Thanks for the update. Indeed, it is essential to this patch for
> >>> toolchain to provide forced alignment. We can test this flag in the
> >>> Makefile to sort out if toolchain supports it or not. Meanwhile, I had
> >>> figured out a way for this to work on any 2-B align addresses but
> >>> hadn't implemented it out yet. Basically it would require more
> >>> patching space for us to do software alignment. I would opt for a
> >>> special toolchain flag if the toolchain just supports it.
> >>>
> >>> Let me take some time to look and get back to you soon.
> >>
> >> Thank you! Looking forward to it.
> >>
> >> In case it helps, here is what I have checked so far.
> >>
> >> 1.
> >> I added the patch
> >> https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326
> >> to the current revision of GCC 13.2.0 from RISC-V toolchain.
> >>
> >> Rebased your patchset on top of Linux 6.8-rc4 (mostly - context changes,
> >> SYM_FUNC_START/SYM_FUNC_END for asm symbols, etc.).
> >>
> >> Reverted 8547649981e6 ("riscv: ftrace: Fixup panic by disabling
> >> preemption").
> >>
> >> Switched from -falign-functions=4 to -fmin-function-alignment=4:
> >> ------------------
> >> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
> >> index b33b787c8b07..dcd0adeebaae 100644
> >> --- a/arch/riscv/Makefile
> >> +++ b/arch/riscv/Makefile
> >> @@ -15,9 +15,9 @@ ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
> >> LDFLAGS_vmlinux += --no-relax
> >> KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
> >> ifeq ($(CONFIG_RISCV_ISA_C),y)
> >> - CC_FLAGS_FTRACE := -fpatchable-function-entry=12 -falign-functions=4
> >> + CC_FLAGS_FTRACE := -fpatchable-function-entry=12
> >> -fmin-function-alignment=4
> >> else
> >> - CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -falign-functions=4
> >> + CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -fmin-function-alignment=4
> >> endif
> >> endif
> >>
> >> ------------------
> >>
> >> As far as I can see from objdump, the functions that were not aligned at
> >> 4-byte boundary with -falign-functions=4, are now aligned correctly with
> >> -fmin-function-alignment=4.
> >>
> >> 2.
> >> I tried the kernel in a QEMU VM with 2 CPUs and "-machine virt".
> >>
> >> The boottime tests for Ftrace had passed, except the tests for
> >> function_graph. I described the failure and the possible fix here:
> >> https://lore.kernel.org/all/dcc5976d-635a-4710-92df-94a99653314e@yadro.com/
> >
> > Indeed, this is needed. I am not sure why I got ftrace boot-time tests
> > passed back then. Thank you for solving it!
> >
> >>
> >> 3.
> >> There were also boottime warnings about "RCU not on for:
> >> arch_cpu_idle+0x0/0x2c". These are probably not related to your
> >> patchset, but rather to the fact that Ftrace is enabled in a preemptble
> >> kernel where RCU does different things.
> >>
> >> As a workaround, I disabled tracing of arch_cpu_idle() for now:
> >> ------------------
> >> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> >> index 92922dbd5b5c..6abeecbfc51d 100644
> >> --- a/arch/riscv/kernel/process.c
> >> +++ b/arch/riscv/kernel/process.c
> >> @@ -37,7 +37,7 @@ EXPORT_SYMBOL(__stack_chk_guard);
> >>
> >> extern asmlinkage void ret_from_fork(void);
> >>
> >> -void arch_cpu_idle(void)
> >> +void noinstr arch_cpu_idle(void)
> >> {
> >> cpu_do_idle();
> >> }
> >>
> >> ------------------
> >>
> >> 4.
> >> Stress-testing revealed an issue though, which I do not understand yet.
> >>
> >> Probably similar to what you did earlier, I ran a script that switched
> >> the current tracer to "function", "function_graph", "nop", "blk" each
> >> 1-5 seconds. In another shell, "stress-ng --hrtimers 1" was running.
> >>
> >> The kernel usually crashed within a few minutes, in seemingly random
> >> locations, but often in one of two ways:
> >>
> >> (a) Invalid instruction, because the address of ftrace_caller function
> >> was somehow written to the body of the traced function rather than just
> >> to the Ftrace prologue.
> >
> > The reason for this is probably that any one of your ftrace_*_call is
> > not 8-B aligned.
>
> I thought, all locations where the address of a ftrace_caller function
> is written are 8-byte aligned, if the compiler guarantees that start
> addresses of all functions are 4-byte aligned. Your patchset provides 2
> kinds of function prologues exactly for that purpose. Am I missing
> something?
Yes, it's true, and that is the first step of ftrace, e.g. to jump
into a ftrace trampoline. The second step for ftrace is to jump to the
actual ftrace handler function. We have to use a 8B-aligned .text
address to store the pointer to the handler. So it could be atomically
patched, or loaded, in dynamic ftrace.
>
> >
> >>
> >> In the following example, the crash happened at 0xffffffff800d3398. "b0
> >> d7" is actually not part of the code here, but rather the lower bytes of
> >> 0xffffffff8000d7b0, the address of ftrace_caller() in this kernel.
> >
> > It seems like there is a bug in patch_insn_write(). I think we should
> > at least disable migration during patch_map() and patch_unmap(). I'd
> > need some time to dig into patch_map(). But since __set_fixmap() only
> > flush local tlb, I'd assume it is not safe to context switch out and
> > migrate while holding the fix-map mapping. Adding preempt_disable()
> > and preempt_enable() before calling __patch_insn_write() solves the
> > issue.
> >
>
> Interesting.
> Thanks for pointing that out! I never though that the task could migrate
> to a different CPU while patch_insn_write() is running. If it could,
> that would cause such issues, sure. And probably - the issues with
> "function_graph" too, if some data were corrupted that way rather than code.
I found another issue with function_graph in preemptible Vector, not
directly related to function_graph though. Currently we don't support
calling schedule() within kernel_vector_{begin,end}. However, this
could be inevitable with ftrace + preemption. For example, preemptible
vectorized uaccess could call into return_to_handler, then call
schedule() when returned from kernel_vector_begin(). This can cause
the following Vector operation fail with illegal instruction because
VS was turned off during context switch.
kernel_vector_begin();
//=> return_to_handler
//==> ... schedule()
remain = __asm_vector_usercopy(dst, src, n);
kernel_vector_end();
Here is what we can do if we'd support calling schedule() while in an
active preempt_v.
static __always_inline void __vstate_csr_save(struct __riscv_v_ext_state *dest)
{
asm volatile (
@@ -243,6 +248,11 @@ static inline void __switch_to_vector(struct
task_struct *prev,
struct pt_regs *regs;
if (riscv_preempt_v_started(prev)) {
+ if (riscv_v_is_on()) {
+ WARN_ON(prev->thread.riscv_v_flags &
RISCV_V_CTX_DEPTH_MASK);
+ riscv_v_disable();
+ prev->thread.riscv_v_flags |=
RISCV_PREEMPT_V_IN_SCHEDULE;
+ }
if (riscv_preempt_v_dirty(prev)) {
__riscv_v_vstate_save(&prev->thread.kernel_vstate,
prev->thread.kernel_vstate.datap);
@@ -253,10 +263,16 @@ static inline void __switch_to_vector(struct
task_struct *prev,
riscv_v_vstate_save(&prev->thread.vstate, regs);
}
- if (riscv_preempt_v_started(next))
- riscv_preempt_v_set_restore(next);
- else
+ if (riscv_preempt_v_started(next)) {
+ if (next->thread.riscv_v_flags & RISCV_PREEMPT_V_IN_SCHEDULE) {
+ next->thread.riscv_v_flags &=
~RISCV_PREEMPT_V_IN_SCHEDULE;
+ riscv_v_enable();
+ } else {
+ riscv_preempt_v_set_restore(next);
+ }
+ } else {
riscv_v_vstate_set_restore(next, task_pt_regs(next));
+ }
}
>
> >>
> >> (gdb) disas /r 0xffffffff800d3382,+0x20
> >> Dump of assembler code from 0xffffffff800d3382 to 0xffffffff800d33a2:
> >> ...
> >> 0xffffffff800d3394 <clockevents_program_event+144>: ba 87 mv
> >> a5,a4
> >> 0xffffffff800d3396 <clockevents_program_event+146>: c1 bf j
> >> 0xffffffff800d3366 <clockevents_program_event+98>
> >> 0xffffffff800d3398 <clockevents_program_event+148>: b0 d7 sw
> >> a2,104(a5) // 0xffffffff8000d7b0, the address of ftrace_caller().
> >> 0xffffffff800d339a <clockevents_program_event+150>: 00 80 .2byte
> >> 0x8000
> >> 0xffffffff800d339c <clockevents_program_event+152>: ff ff .2byte
> >> 0xffff
> >> 0xffffffff800d339e <clockevents_program_event+154>: ff ff .2byte
> >> 0xffff
> >> 0xffffffff800d33a0 <clockevents_program_event+156>: d5 bf j
> >> 0xffffffff800d3394 <clockevents_program_event+144
> >>
> >> The backtrace usually contains one or more occurrences of
> >> return_to_handler() in this case.
> >>
> >> [ 260.520394] [<ffffffff800d3398>] clockevents_program_event+0xac/0x100
> >> [ 260.521195] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> >> [ 260.521843] [<ffffffff800c50ba>] hrtimer_interrupt+0x122/0x20c
> >> [ 260.522492] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> >> [ 260.523132] [<ffffffff8009785e>] handle_percpu_devid_irq+0x9e/0x1ec
> >> [ 260.523788] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> >> [ 260.524437] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> >> [ 260.525080] [<ffffffff80a8acfa>] handle_riscv_irq+0x4a/0x74
> >> [ 260.525726] [<ffffffff80a97b9a>] call_on_irq_stack+0x32/0x40
> >> ----------------------
> >>
> >> (b) Jump to an invalid location, e.g. to the middle of a valid 4-byte
> >> instruction. %ra usually points right after the last instruction, "jalr
> >> a2", in return_to_handler() in such cases, so the jump was likely
> >> made from there.
> >
> > I haven't done fgraph tests yet. I will try out and see.
With the above being fixed, I can pass several hundred (and continue)
rounds of random tracer + stress-ng --hrtimers test.
> >
> >>
> >> The problem is reproducible, although I have not found what causes it yet.
> >>
> >> Any help is appreciated, of course.
> >>
> >>>
> >>>>
> >>>> Regards,
> >>>> Evgenii
> >>>
> >>> Regards,
> >>> Andy
> >>
> >
> > Also, here is another side note,
> >
> > It seems like the ftrace save/restore routine should save more
> > registers as clang's fastcc may use t2 when the number of arguments
> > exceeds what ABI defines for passing arg through registers.
>
> Yes, I reported that issue to LLVM maintainers in
> https://github.com/llvm/llvm-project/issues/83111. It seems, static
> functions with 9+ arguments use t2 and t3, etc. for the 9th and 10th
> arguments when compiled with clang.
>
> Clang seems to leave t0 and t1 alone but I do not know yet, if it is
> just a coincidence. Haven't found the exact rules for fastcc calling
> convention on RISC-V so far.
>
> A compiler option to disable fastcc for the Linux kernel builds would be
> great. But, it seems, the discussion with LLVM maintainers will go
> nowhere without benchmarks to show whether that optimization has any
> significant effect. I plan to find and run proper benchmarks when I have
> time, but not just yet.
>
> >
> > Cheers,
> > Andy
>
> Regards,
> Evgenii
>
>
More information about the linux-riscv
mailing list