[PATCH RFC v2 riscv/for-next 0/5] Enable ftrace with kernel preemption for RISC-V

Thu Mar 21 04:02:20 PDT 2024

On 20/03/2024 17:36, Andy Chiu wrote:
> On Wed, Mar 20, 2024 at 1:37 AM Alexandre Ghiti <alex at ghiti.fr> wrote:
>> Hi Andy,
>>
>> On 18/03/2024 16:31, Andy Chiu wrote:
>>> Hi Evgenii,
>>>
>>> Thanks for your help!
>>>
>>> I just rebased upon 6.8-rc1 and passed the stress-ng + ftrace/nop
>>> testing. I will add some random tracers to test and some optimization
>>> before sending out again. Here are a few things needed:
>>>
>>> On Thu, Feb 22, 2024 at 12:55 AM Evgenii Shatokhin
>>> <e.shatokhin at yadro.com> wrote:
>>>> On 21.02.2024 08:27, Andy Chiu wrote:
>>>>> «Внимание! Данное письмо от внешнего адресата!»
>>>>>
>>>>> On Wed, Feb 14, 2024 at 3:42 AM Evgenii Shatokhin <e.shatokhin at yadro.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 13.09.2022 12:42, Andy Chiu wrote:
>>>>>>> This patch removes dependency of dynamic ftrace from calling
>>>>>>> stop_machine(), and makes it compatiable with kernel preemption.
>>>>>>> Originally, we ran into stack corruptions, or execution of partially
>>>>>>> updated instructions when starting or stopping ftrace on a fully
>>>>>>> preemptible kernel configuration. The reason is that kernel periodically
>>>>>>> calls rcu_momentary_dyntick_idle() on cores waiting for the code-patching
>>>>>>> core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
>>>>>>> marked as notrace, it would call a bunch of tracable functions if we
>>>>>>> configured the kernel as preemptible. For example, these are some functions
>>>>>>> that happened to have a symbol and have not been marked as notrace on a
>>>>>>> RISC-V preemptible kernel compiled with GCC-11:
>>>>>>>      - __rcu_report_exp_rnp()
>>>>>>>      - rcu_report_exp_cpu_mult()
>>>>>>>      - rcu_preempt_deferred_qs()
>>>>>>>      - rcu_preempt_need_deferred_qs()
>>>>>>>      - rcu_preempt_deferred_qs_irqrestore()
>>>>>>>
>>>>>>> Thus, this make it not ideal for us to rely on stop_machine() and
>>>>>>> handly marked "notrace"s to perform runtime code patching. To remove
>>>>>>> such dependency, we must make updates of code seemed atomic on running
>>>>>>> cores. This might not be obvious for RISC-V since it usaually uses a pair
>>>>>>> of AUIPC + JALR to perform a long jump, which cannot be modified and
>>>>>>> executed concurrently if we consider preemptions. As such, this patch
>>>>>>> proposed a way to make it possible. It embeds a 32-bit rel-address data
>>>>>>> into instructions of each ftrace prologue and jumps indirectly. In this
>>>>>>> way, we could store and load the address atomically so that the code
>>>>>>> patching core could run simutaneously with the rest of running cores.
>>>>>>>
>>>>>>> After applying the patchset, we compiled a preemptible kernel with all
>>>>>>> tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU virt
>>>>>>> machine. The kernel could boot up successfully, passing all ftrace
>>>>>>> testsuits. Besides, we ran a script that randomly pick a tracer on every
>>>>>>> 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
>>>>>>> contrast, a preemptible kernel without our patch would panic in few
>>>>>>> rounds on the same machine.
>>>>>>>
>>>>>>> Though we ran into errors when using hwlat or irqsoff tracers together
>>>>>>> with cpu-online stressor from stress-ng on a preemptible kernel. We
>>>>>>> believe the reason may be that  percpu workers of the tracers are being
>>>>>>> queued into unbounded workqueue when cpu get offlined and patches will go
>>>>>>> through tracing tree.
>>>>>>>
>>>>>>> Additionally, we found patching of tracepoints unsafe since the
>>>>>>> instructions being patched are not naturally aligned. This may result in
>>>>>>> 2 half-word stores, which breaks atomicity, during the code patching.
>>>>>>>
>>>>>>> changes in patch v2:
>>>>>>>      - Enforce alignments on all functions with a compiler workaround.
>>>>>>>      - Support 64bit addressing for ftrace targets if xlen == 64
>>>>>>>      - Initialize ftrace target addresses to avoid calling bad address in a
>>>>>>>        hypothesized case.
>>>>>>>      - Use LGPTR instead of SZPTR since .align is log-scaled for
>>>>>>>        mcount-dyn.S
>>>>>>>      - Require the nop instruction of all jump_labels aligns naturally on
>>>>>>>        4B.
>>>>>>>
>>>>>>> Andy Chiu (5):
>>>>>>>       riscv: align ftrace to 4 Byte boundary and increase ftrace prologue
>>>>>>>         size
>>>>>>>       riscv: export patch_insn_write
>>>>>>>       riscv: ftrace: use indirect jump to work with kernel preemption
>>>>>>>       riscv: ftrace: do not use stop_machine to update code
>>>>>>>       riscv: align arch_static_branch function
>>>>>>>
>>>>>>>      arch/riscv/Makefile                 |   2 +-
>>>>>>>      arch/riscv/include/asm/ftrace.h     |  24 ----
>>>>>>>      arch/riscv/include/asm/jump_label.h |   2 +
>>>>>>>      arch/riscv/include/asm/patch.h      |   1 +
>>>>>>>      arch/riscv/kernel/ftrace.c          | 179 ++++++++++++++++++++--------
>>>>>>>      arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
>>>>>>>      arch/riscv/kernel/patch.c           |   4 +-
>>>>>>>      7 files changed, 188 insertions(+), 93 deletions(-)
>>>>>>>
>>>>>> First of all, thank you for working on making dynamic Ftrace robust in
>>>>>> preemptible kernels on RISC-V.
>>>>>> It is an important use case but, for now, dynamic Ftrace and related
>>>>>> tracers cannot be safely used with such kernels.
>>>>>>
>>>>>> Are there any updates on this series?
>>>>>> It needs a rebase, of course, but it looks doable.
>>>>>>
>>>>>> If I understand the discussion correctly, the only blocker was that
>>>>>> using "-falign-functions" was not enough to properly align cold
>>>>>> functions and "-fno-guess-branch-probability" would likely have a
>>>>>> performance cost.
>>>>>>
>>>>>> It seems, GCC developers have recently provided a workaround for that
>>>>>> (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326,
>>>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).
>>>>>>
>>>>>> "-fmin-function-alignment" should help but, I do not know, which GCC
>>>>>> versions have got that patch already. In the meantime, one could
>>>>>> probably check if "-fmin-function-alignment" is supported by the
>>>>>> compiler and use it, if it is.
>>>>>>
>>>>>> Thoughts?
>>>>> Hi Evgenii,
>>>>>
>>>>> Thanks for the update. Indeed, it is essential to this patch for
>>>>> toolchain to provide forced alignment. We can test this flag in the
>>>>> Makefile to sort out if toolchain supports it or not. Meanwhile, I had
>>>>> figured out a way for this to work on any 2-B align addresses but
>>>>> hadn't implemented it out yet. Basically it would require more
>>>>> patching space for us to do software alignment. I would opt for a
>>>>> special toolchain flag if the toolchain just supports it.
>>>>>
>>>>> Let me take some time to look and get back to you soon.
>>>> Thank you! Looking forward to it.
>>>>
>>>> In case it helps, here is what I have checked so far.
>>>>
>>>> 1.
>>>> I added the patch
>>>> https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326
>>>> to the current revision of GCC 13.2.0 from RISC-V toolchain.
>>>>
>>>> Rebased your patchset on top of Linux 6.8-rc4 (mostly - context changes,
>>>> SYM_FUNC_START/SYM_FUNC_END for asm symbols, etc.).
>>>>
>>>> Reverted 8547649981e6 ("riscv: ftrace: Fixup panic by disabling
>>>> preemption").
>>>>
>>>> Switched from -falign-functions=4 to -fmin-function-alignment=4:
>>>> ------------------
>>>> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
>>>> index b33b787c8b07..dcd0adeebaae 100644
>>>> --- a/arch/riscv/Makefile
>>>> +++ b/arch/riscv/Makefile
>>>> @@ -15,9 +15,9 @@ ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>>>>           LDFLAGS_vmlinux += --no-relax
>>>>           KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
>>>>     ifeq ($(CONFIG_RISCV_ISA_C),y)
>>>> -       CC_FLAGS_FTRACE := -fpatchable-function-entry=12 -falign-functions=4
>>>> +       CC_FLAGS_FTRACE := -fpatchable-function-entry=12
>>>> -fmin-function-alignment=4
>>>>     else
>>>> -       CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -falign-functions=4
>>>> +       CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -fmin-function-alignment=4
>>>>     endif
>>>>     endif
>>>>
>>>> ------------------
>>>>
>>>> As far as I can see from objdump, the functions that were not aligned at
>>>> 4-byte boundary with -falign-functions=4, are now aligned correctly with
>>>> -fmin-function-alignment=4.
>>>>
>>>> 2.
>>>> I tried the kernel in a QEMU VM with 2 CPUs and "-machine virt".
>>>>
>>>> The boottime tests for Ftrace had passed, except the tests for
>>>> function_graph. I described the failure and the possible fix here:
>>>> https://lore.kernel.org/all/dcc5976d-635a-4710-92df-94a99653314e@yadro.com/
>>> Indeed, this is needed. I am not sure why I got ftrace boot-time tests
>>> passed back then. Thank you for solving it!
>>>
>>>> 3.
>>>> There were also boottime warnings about "RCU not on for:
>>>> arch_cpu_idle+0x0/0x2c". These are probably not related to your
>>>> patchset, but rather to the fact that Ftrace is enabled in a preemptble
>>>> kernel where RCU does different things.
>>>>
>>>> As a workaround, I disabled tracing of arch_cpu_idle() for now:
>>>> ------------------
>>>> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
>>>> index 92922dbd5b5c..6abeecbfc51d 100644
>>>> --- a/arch/riscv/kernel/process.c
>>>> +++ b/arch/riscv/kernel/process.c
>>>> @@ -37,7 +37,7 @@ EXPORT_SYMBOL(__stack_chk_guard);
>>>>
>>>>     extern asmlinkage void ret_from_fork(void);
>>>>
>>>> -void arch_cpu_idle(void)
>>>> +void noinstr arch_cpu_idle(void)
>>>>     {
>>>>           cpu_do_idle();
>>>>     }
>>>>
>>>> ------------------
>>>>
>>>> 4.
>>>> Stress-testing revealed an issue though, which I do not understand yet.
>>>>
>>>> Probably similar to what you did earlier, I ran a script that switched
>>>> the current tracer to "function", "function_graph", "nop", "blk" each
>>>> 1-5 seconds. In another shell, "stress-ng --hrtimers 1" was running.
>>>>
>>>> The kernel usually crashed within a few minutes, in seemingly random
>>>> locations, but often in one of two ways:
>>>>
>>>> (a) Invalid instruction, because the address of ftrace_caller function
>>>> was somehow written to the body of the traced function rather than just
>>>> to the Ftrace prologue.
>>> The reason for this is probably that any one of your ftrace_*_call is
>>> not 8-B aligned.
>>>
>>>> In the following example, the crash happened at 0xffffffff800d3398. "b0
>>>> d7" is actually not part of the code here, but rather the lower bytes of
>>>> 0xffffffff8000d7b0, the address of ftrace_caller() in this kernel.
>>> It seems like there is a bug in patch_insn_write(). I think we should
>>> at least disable migration during patch_map() and patch_unmap(). I'd
>>> need some time to dig into patch_map(). But since __set_fixmap() only
>>> flush local tlb, I'd assume it is not safe to context switch out and
>>> migrate while holding the fix-map mapping. Adding preempt_disable()
>>> and preempt_enable() before calling __patch_insn_write() solves the
>>> issue.
>>
>> Yes, Andrea already mentioned this, I came up with the same idea of
>> preempt_disable() but then I noticed arm64 actually disables IRQ: any
>> idea why?
>> https://lore.kernel.org/linux-riscv/CAHVXubj7ChgpvN4F_QO0oASaT5WC2VS0Q-bEqhnmF8z8QV=yDQ@mail.gmail.com/
> Hi, I took a quick look and it seems that it is a design choice in
> software to me. ARM uses a spinlock to protect text and we use a
> mutex. If they have a requirement to do patching while irq is off
> (maybe in an ipi handler), then the only viable option would be to use
> raw_spin_lock_irqsave. I think preempt_disable should be enough for us
> if we use text_mutex to protect patching. Or, am I missing something?

I agree with you, I convinced myself that it should be enough :)

Do you intend to send this patch? Or should I? I have another small fix 
for ftrace, so I don't mind sending this one. Up to you, we just need to 
make sure it lands in 6.9 :)

Thanks

>
>
>
>
>>
>>>> (gdb) disas /r 0xffffffff800d3382,+0x20
>>>> Dump of assembler code from 0xffffffff800d3382 to 0xffffffff800d33a2:
>>>> ...
>>>>       0xffffffff800d3394 <clockevents_program_event+144>:  ba 87   mv
>>>> a5,a4
>>>>       0xffffffff800d3396 <clockevents_program_event+146>:  c1 bf   j
>>>> 0xffffffff800d3366 <clockevents_program_event+98>
>>>>       0xffffffff800d3398 <clockevents_program_event+148>:  b0 d7   sw
>>>> a2,104(a5) // 0xffffffff8000d7b0, the address of ftrace_caller().
>>>>       0xffffffff800d339a <clockevents_program_event+150>:  00 80   .2byte
>>>> 0x8000
>>>>       0xffffffff800d339c <clockevents_program_event+152>:  ff ff   .2byte
>>>> 0xffff
>>>>       0xffffffff800d339e <clockevents_program_event+154>:  ff ff   .2byte
>>>> 0xffff
>>>>       0xffffffff800d33a0 <clockevents_program_event+156>:  d5 bf   j
>>>> 0xffffffff800d3394 <clockevents_program_event+144
>>>>
>>>> The backtrace usually contains one or more occurrences of
>>>> return_to_handler() in this case.
>>>>
>>>> [  260.520394] [<ffffffff800d3398>] clockevents_program_event+0xac/0x100
>>>> [  260.521195] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>>>> [  260.521843] [<ffffffff800c50ba>] hrtimer_interrupt+0x122/0x20c
>>>> [  260.522492] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>>>> [  260.523132] [<ffffffff8009785e>] handle_percpu_devid_irq+0x9e/0x1ec
>>>> [  260.523788] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>>>> [  260.524437] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>>>> [  260.525080] [<ffffffff80a8acfa>] handle_riscv_irq+0x4a/0x74
>>>> [  260.525726] [<ffffffff80a97b9a>] call_on_irq_stack+0x32/0x40
>>>> ----------------------
>>>>
>>>> (b) Jump to an invalid location, e.g. to the middle of a valid 4-byte
>>>> instruction. %ra usually points right after the last instruction, "jalr
>>>>      a2", in return_to_handler() in such cases, so the jump was likely
>>>> made from there.
>>> I haven't done fgraph tests yet. I will try out and see.
>>>
>>>> The problem is reproducible, although I have not found what causes it yet.
>>>>
>>>> Any help is appreciated, of course.
>>>>
>>>>>> Regards,
>>>>>> Evgenii
>>>>> Regards,
>>>>> Andy
>>> Also, here is another side note,
>>>
>>> It seems like the ftrace save/restore routine should save more
>>> registers as clang's fastcc may use t2 when the number of arguments
>>> exceeds what ABI defines for passing arg through registers.
>>>
>>> Cheers,
>>> Andy
>>>
>>> _______________________________________________
>>> linux-riscv mailing list
>>> linux-riscv at lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-riscv