[PATCH v4 25/30] context_tracking,x86: Defer kernel text patching IPIs
Valentin Schneider
vschneid at redhat.com
Fri Jan 17 01:47:49 PST 2025
On 14/01/25 13:13, Sean Christopherson wrote:
> On Tue, Jan 14, 2025, Valentin Schneider wrote:
>> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
>> them vs the newly patched instruction. CPUs that are executing in userspace
>> do not need this synchronization to happen immediately, and this is
>> actually harmful interference for NOHZ_FULL CPUs.
>
> ...
>
>> This leaves us with static keys and static calls.
>
> ...
>
>> @@ -2317,11 +2334,20 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
>> * First step: add a int3 trap to the address that will be patched.
>> */
>> for (i = 0; i < nr_entries; i++) {
>> - tp[i].old = *(u8 *)text_poke_addr(&tp[i]);
>> - text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
>> + void *addr = text_poke_addr(&tp[i]);
>> +
>> + /*
>> + * There's no safe way to defer IPIs for patching text in
>> + * .noinstr, record whether there is at least one such poke.
>> + */
>> + if (is_kernel_noinstr_text((unsigned long)addr))
>> + cond = NULL;
>
> Maybe pre-check "cond", especially if multiple ranges need to be checked? I.e.
>
> if (cond && is_kernel_noinstr_text(...))
>> +
>> + tp[i].old = *((u8 *)addr);
>> + text_poke(addr, &int3, INT3_INSN_SIZE);
>> }
>>
>> - text_poke_sync();
>> + __text_poke_sync(cond);
>>
>> /*
>> * Second step: update all but the first byte of the patched range.
>
> ...
>
>> +/**
>> + * is_kernel_noinstr_text - checks if the pointer address is located in the
>> + * .noinstr section
>> + *
>> + * @addr: address to check
>> + *
>> + * Returns: true if the address is located in .noinstr, false otherwise.
>> + */
>> +static inline bool is_kernel_noinstr_text(unsigned long addr)
>> +{
>> + return addr >= (unsigned long)__noinstr_text_start &&
>> + addr < (unsigned long)__noinstr_text_end;
>> +}
>
> This doesn't do the right thing for modules, which matters because KVM can be
> built as a module on x86, and because context tracking understands transitions
> to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not
> being in the kernel, and thus will have IPIs deferred. If KVM uses a static key
> or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the
> patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically
> use the wrong static path.
>
AFAICT guest_state_{enter,exit}_irqoff() are only used in noinstr functions
and thus such a static key usage should at the very least be caught and
warned about by objtool - when this isn't built as a module.
I never really thought about noinstr sections for modules; I can get
objtool to warn about a non-noinstr allowed key being used in
e.g. vmx_vcpu_enter_exit() just by feeding it the vmx.o:
arch/x86/kvm/vmx/vmx.o: warning: objtool: vmx_vcpu_enter_exit.isra.0+0x0: dummykey: non-RO static key usage in noinstr
...but that requires removing a lot of code first because objtool stops
earlier in its noinstr checks as it hits functions it doesn't have full
information on, e.g.
arch/x86/kvm/vmx/vmx.o: warning: objtool: vmx_vcpu_enter_exit+0x21c: call to __ct_user_enter() leaves .noinstr.text section
__ct_user_enter() *is* noinstr, but you don't get that from just the header prototype.
> I don't expect this to ever cause problems in practice, because patching code in
> KVM's VM-Enter/VM-Exit path that has *functional* implications, while CPUs are
> actively running guest code, would be all kinds of crazy. But I do think we
> should plug the hole.
>
> If this issue is unique to KVM, i.e. is not a generic problem for all modules (I
> assume module code generally isn't allowed in the entry path, even via NMI?), one
> idea would be to let KVM register its noinstr section for text poking.
More information about the linux-riscv
mailing list