[PATCH v5 00/25] context_tracking,x86: Defer some IPIs until a user->kernel transition

Wed Apr 30 12:42:28 PDT 2025

On Wed, 30 Apr 2025 11:07:35 -0700
Dave Hansen <dave.hansen at intel.com> wrote:

> On 4/30/25 10:20, Steven Rostedt wrote:
> > On Tue, 29 Apr 2025 09:11:57 -0700
> > Dave Hansen <dave.hansen at intel.com> wrote:
> >   
> >> I don't think we should do this series.  
> > 
> > Could you provide more rationale for your decision.  
> 
> I talked about it a bit in here:
> 
> > https://lore.kernel.org/all/408ebd8b-4bfb-4c4f-b118-7fe853c6e897@intel.com/  

Hmm, that's easily missed. But thanks for linking it.

> 
> But, basically, this series puts a new onus on the entry code: it can't
> touch the vmalloc() area ... except the LDT ... and except the PEBS
> buffers. If anyone touches vmalloc()'d memory (or anything else that
> eventually gets deferred), they crash. They _only_ crash on these
> NOHZ_FULL systems.
> 
> Putting new restrictions on the entry code is really nasty. Let's say a
> new hardware feature showed up that touched vmalloc()'d memory in the
> entry code. Probably, nobody would notice until they got that new
> hardware and tried to do a NOHZ_FULL workload. It might take years to
> uncover, once that hardware was out in the wild.
> 
> I have a substantial number of gray hairs from dealing with corner cases
> in the entry code.
> 
> You _could_ make it more debuggable. Could you make this work for all
> tasks, not just NOHZ_FULL? The same logic _should_ apply. It would be
> inefficient, but would provide good debugging coverage.
> 
> I also mentioned this earlier, but PTI could be leveraged here to ensure
> that the TLB is flushed properly. You could have the rule that anything
> mapped into the user page table can't have a deferred flush and then do
> deferred flushes at SWITCH_TO_KERNEL_CR3 time. Yeah, that's in
> arch-specific assembly, but it's a million times easier to reason about
> because the window where a deferred-flush allocation might bite you is
> so small.
> 
> Look at the syscall code for instance:
> 
> > SYM_CODE_START(entry_SYSCALL_64)
> >         swapgs
> >         movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
> >         SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp  
> 
> You can _trivially_ audit this and know that swapgs doesn't touch memory
> and that as long as PER_CPU_VAR()s and the process stack don't have
> their mappings munged and flushes deferred that this would be correct.

Hmm, so there is still a path for this?

At least if it added more ways to debug it, and some other changes to make
the locations where vmalloc is dangerous smaller?

> 
> >> If folks want this functionality, they should get a new CPU that can
> >> flush the TLB without IPIs.  
> > 
> > That's a pretty heavy handed response. I'm not sure that's always a
> > feasible solution.
> > 
> > From my experience in the world, software has always been around to fix the
> > hardware, not the other way around ;-)  
> 
> Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think.
> You can go buy the Intel hardware off the shelf today.

Sure, but changing CPUs on machines is not always that feasible either.

-- Steve