[PATCH v4 12/12] sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state

Eric W. Biederman ebiederm at xmission.com
Sat Jun 25 09:34:46 PDT 2022


Alexander Gordeev <agordeev at linux.ibm.com> writes:

> On Tue, Jun 21, 2022 at 09:02:05AM -0500, Eric W. Biederman wrote:
>> Alexander Gordeev <agordeev at linux.ibm.com> writes:
>> 
>> > On Thu, May 05, 2022 at 01:26:45PM -0500, Eric W. Biederman wrote:
>> >> From: Peter Zijlstra <peterz at infradead.org>
>> >> 
>> >> Currently ptrace_stop() / do_signal_stop() rely on the special states
>> >> TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this
>> >> state exists only in task->__state and nowhere else.
>> >> 
>> >> There's two spots of bother with this:
>> >> 
>> >>  - PREEMPT_RT has task->saved_state which complicates matters,
>> >>    meaning task_is_{traced,stopped}() needs to check an additional
>> >>    variable.
>> >> 
>> >>  - An alternative freezer implementation that itself relies on a
>> >>    special TASK state would loose TASK_TRACED/TASK_STOPPED and will
>> >>    result in misbehaviour.
>> >> 
>> >> As such, add additional state to task->jobctl to track this state
>> >> outside of task->__state.
>> >> 
>> >> NOTE: this doesn't actually fix anything yet, just adds extra state.
>> >> 
>> >> --EWB
>> >>   * didn't add a unnecessary newline in signal.h
>> >>   * Update t->jobctl in signal_wake_up and ptrace_signal_wake_up
>> >>     instead of in signal_wake_up_state.  This prevents the clearing
>> >>     of TASK_STOPPED and TASK_TRACED from getting lost.
>> >>   * Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared
>> >
>> > Hi Eric, Peter,
>> >
>> > On s390 this patch triggers warning at kernel/ptrace.c:272 when
>> > kill_child testcase from strace tool is repeatedly used (the source
>> > is attached for reference):
>> >
>> > while :; do
>> > 	strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child
>> > done
>> >
>> > It normally takes few minutes to cause the warning in -rc3, but FWIW
>> > it hits almost immediately for ptrace_stop-cleanup-for-v5.19 tag of
>> > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.
>> >
>> > Commit 7b0fe1367ef2 ("ptrace: Document that wait_task_inactive can't
>> > fail") suggests this WARN_ON_ONCE() is not really expected, yet we
>> > observe a child in __TASK_TRACED state. Could you please comment here?
>> >
>> 
>> For clarity the warning is that the child is not in __TASK_TRACED state.
>> 
>> The code is waiting for the code to stop in the scheduler in the
>> __TASK_TRACED state so that it can safely read and change the
>> processes state.  Some of that state is not even saved until the
>> process is scheduled out so we have to wait until the process
>> is stopped in the scheduler.
>
> So I assume (checked actually) the return 0 below from kernel/sched/core.c:
> wait_task_inactive() is where it bails out:
>
> 3303                 while (task_running(rq, p)) {
> 3304                         if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
> 3305                                 return 0;
> 3306                         cpu_relax();
> 3307                 }
>
> Yet, the child task is always found in __TASK_TRACED state (as seen
> in crash dumps):
>
>> 101447  11342  13      ce3a8100      RU   0.0   10040   4412  strace
>   101450  101447   0      bb04b200      TR   0.0    2272   1136  kill_child
>   108261  101447   2      d0b10100      TR   0.0    2272    532  kill_child
> crash> task bb04b200 __state
> PID: 101450  TASK: bb04b200          CPU: 0   COMMAND: "kill_child"
>   __state = 8,
>
> crash> task d0b10100 __state
> PID: 108261  TASK: d0b10100          CPU: 2   COMMAND: "kill_child"
>   __state = 8,
>

I haven't gotten as far as reproducing this but I have started giving
this issue some thought.

This entire thing smells like a memory barrier is missing somewhere.
However by definition the lock implementations in linux provide all the
needed memory barriers, and in the ptrace_stop and ptrace_check_attach
path I don't see cases where these values are sampled outside of a lock
except in wait_task_inactive.  Does doing that perhaps require a
barrier? 

The two things I can think of that could shed light on what is going on
is enabling lockdep, to enable the debug check in signal_wake_up_state
and verifying bits of state that should be constant while the task
is frozen for ptrace are indeed constant when task is frozen for ptrace.
Something like my patch below.

If you could test that when you have a chance that would help narrow
down what is going on.

Thank you,
Eric

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 156a99283b11..6467a2b1c3bc 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -268,9 +268,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 	}
 	read_unlock(&tasklist_lock);
 
-	if (!ret && !ignore_state &&
-	    WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED)))
+	if (!ret && !ignore_state) {
+		WARN_ON_ONCE(!(child->jobctl & JOBCTL_PTRACE_FROZEN));
+		WARN_ON_ONCE(!(child->joctctl & JOBCTL_TRACED));
+		WARN_ON_ONCE(READ_ONCE(child->__state) != __TASK_TRACED);
+		WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED));
 		ret = -ESRCH;
+	}
 
 	return ret;
 }



More information about the linux-um mailing list