TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)

Thu Jul 31 11:47:31 PDT 2014

On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
> On 07/31, Frederic Weisbecker wrote:
> >
> > On Thu, Jul 31, 2014 at 06:03:53PM +0200, Oleg Nesterov wrote:
> > > On 07/31, Frederic Weisbecker wrote:
> > > >
> > > > I was convinced that the very first kernel init task is PID 0 then
> > > > it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
> > > > idle task of the boot CPU.
> > >
> > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
> > > by "swapper".
> >
> > Are you sure? It's called from start_kernel() which is init/0.
> 
> But do_initcalls() is called by kernel_init(), this is the init process which is
> going to exec /sbin/init later.
> 
> But this doesn't really matter,

Yeah but tick_nohz_init() is not an initcall, it's a function called from start_kernel(),
before initcalls.

> 
> > Maybe we should just enable it everywhere.
> 
> Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start
> of start_kernel(). The question is, I still can't understand why do we want to
> have the global TIF_NOHZ.

Because then the flags is inherited in forks. It's better than inheriting it on
context switch due to context switch being called much more often than fork.

> 
> > > OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on
> > > CPU_0, and CPU_1 can miss it anyway.
> > >
> > > > But global TIF_NOHZ should enforce context tracking everywhere.
> > >
> > > And this is what I can't understand. Lets return to my initial question, why
> > > we can't change __context_tracking_task_switch()
> > >
> > > 	void __context_tracking_task_switch(struct task_struct *prev,
> > > 					    struct task_struct *next)
> > > 	{
> > > 		if (context_tracking_cpu_is_enabled())
> > > 			set_tsk_thread_flag(next, TIF_NOHZ);
> > > 		else
> > > 			clear_tsk_thread_flag(next, TIF_NOHZ);
> > > 	}
> > >
> > > ?
> >
> > Well we can change it to global TIF_NOHZ
> >
> > > How can the global TIF_NOHZ help?
> >
> > It avoids that flag swap on task_switch.
> 
> Ah, you probably meant that we can kill context_tracking_task_switch() as
> we discussed.

Yeah.

> 
> But I meant another thing, TIF_NOHZ is already global because it is always
> set after context_tracking_cpu_set().
> 
> Performance-wise, this set/clear code above can be better because it avoids
> the slow paths on the untracked CPU's.

But all CPUs are tracked when context tracking is enabled. So there is no such
per CPU granularity.

> But let's ignore this, the question is
> why the change above is not correct? Or why it can make the things worse?

Which change above?

> 
> 
> > > OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the
> > > slow path, and do the "right" thing if it migrates to CPU_1 _before_ it
> > > comes to user_enter(). But this case is very unlikely, certainly this can't
> > > explain why do we penalize the untracked CPU's ?
> >
> > It's rather that CPU 0 calls user_enter() and then migrate to CPU 1 and resume
> > to userspace.
> 
> And in this case a) user_enter() is pointless on CPU_0, and b) CPU_1 misses
> the necessary user_enter().

No, user_enter() is necessary on CPU 0 so that CPU 1 sees that it is in userspace
from context tracking POV.

> 
> > It's unlikely but possible. I actually observed that very easily on early testing.
> 
> Sure. And this can happen anyway? Why the change in __context_tracking_task_switch()
> is wrong?

Which change?

> 
> > And it's a big problem because then the CPU runs in userspace, possibly for a long while
> > in HPC case, and context tracking thinks it is in kernelspace. As a result, RCU waits
> > for that CPU to complete grace periods and cputime is accounted to kernelspace instead of
> > userspace.
> >
> > It looks like a harmless race but it can have big consequences.
> 
> I see. Again, does the global TIF_NOHZ really help?

Yes, to remove the context switch overhead. But it doesn't change anything on those races.

> > Because calling context_switch_task_switch() on every context switch is costly.
> 
> See above. This depends, but forcing the slow paths on all CPU's can be more
> costly.

Yeah but I don't have a safe solution that avoids that.

> 
> > > And of course I can't understand exception_enter/exit(). Not to mention that
> > > (afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even
> > > if we forget that the caller can migrate in between. Just because, once again,
> > > a tracked CPU can miss user_exit().
> >
> > You lost me on this. How can a tracked CPU miss user_exit()?
> 
> I am lost too ;) Didn't we already discuss this? A task calls user_exit() on
> CPU_0 for no reason, migrates to the tracked CPU_1 and finally returns to user
> space leaving this cpu in IN_KERNEL state?

Yeah, so? :)

I'm pretty sure there is a small but important element here that makes us misunderstanding
what each others says. It's like we aren't taking about the same thing :)

> > That's how I implemented it first. But then I changed it the way it is now:
> > 6c1e0256fad84a843d915414e4b5973b7443d48d
> > ("context_tracking: Restore correct previous context state on exception exit")
> >
> > This is again due to the shift between hard and soft userspace boundaries.
> > user_mode(regs) checks hard boundaries only.
> >
> > Lets get back to our beloved example:
> >
> >           CPU 0                                  CPU 1
> >           ---------------------------------------------
> >
> >           returning from syscall {
> >                user_enter();
> >                exception {
> >                     exception_enter()
> >                     PREEMPT!
> >                     ----------------------->
> >                                                  //resume exception
> >                                                    exception_exit();
> >                                                    return to userspace
> 
> OK, thanks, so in this case we miss user_enter().
> 
> But again, we can miss it anyway if preemptions comes before "exception" ?
> if CPU 1 was in IN_KERNEL state.

No, because preempt_schedule_irq() does the ctx_state save and restore with
exception_enter/exception_exit.