[PATCH v4 12/12] sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state

Eric W. Biederman ebiederm at xmission.com
Tue Jun 21 10:47:30 PDT 2022


Alexander Gordeev <agordeev at linux.ibm.com> writes:

> On Tue, Jun 21, 2022 at 09:02:05AM -0500, Eric W. Biederman wrote:
>> Alexander Gordeev <agordeev at linux.ibm.com> writes:
>> 
>> > On Thu, May 05, 2022 at 01:26:45PM -0500, Eric W. Biederman wrote:
>> >> From: Peter Zijlstra <peterz at infradead.org>
>> >> 
>> >> Currently ptrace_stop() / do_signal_stop() rely on the special states
>> >> TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this
>> >> state exists only in task->__state and nowhere else.
>> >> 
>> >> There's two spots of bother with this:
>> >> 
>> >>  - PREEMPT_RT has task->saved_state which complicates matters,
>> >>    meaning task_is_{traced,stopped}() needs to check an additional
>> >>    variable.
>> >> 
>> >>  - An alternative freezer implementation that itself relies on a
>> >>    special TASK state would loose TASK_TRACED/TASK_STOPPED and will
>> >>    result in misbehaviour.
>> >> 
>> >> As such, add additional state to task->jobctl to track this state
>> >> outside of task->__state.
>> >> 
>> >> NOTE: this doesn't actually fix anything yet, just adds extra state.
>> >> 
>> >> --EWB
>> >>   * didn't add a unnecessary newline in signal.h
>> >>   * Update t->jobctl in signal_wake_up and ptrace_signal_wake_up
>> >>     instead of in signal_wake_up_state.  This prevents the clearing
>> >>     of TASK_STOPPED and TASK_TRACED from getting lost.
>> >>   * Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared
>> >
>> > Hi Eric, Peter,
>> >
>> > On s390 this patch triggers warning at kernel/ptrace.c:272 when
>> > kill_child testcase from strace tool is repeatedly used (the source
>> > is attached for reference):
>> >
>> > while :; do
>> > 	strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child
>> > done
>> >
>> > It normally takes few minutes to cause the warning in -rc3, but FWIW
>> > it hits almost immediately for ptrace_stop-cleanup-for-v5.19 tag of
>> > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.
>> >
>> > Commit 7b0fe1367ef2 ("ptrace: Document that wait_task_inactive can't
>> > fail") suggests this WARN_ON_ONCE() is not really expected, yet we
>> > observe a child in __TASK_TRACED state. Could you please comment here?
>> >
>> 
>> For clarity the warning is that the child is not in __TASK_TRACED state.
>> 
>> The code is waiting for the code to stop in the scheduler in the
>> __TASK_TRACED state so that it can safely read and change the
>> processes state.  Some of that state is not even saved until the
>> process is scheduled out so we have to wait until the process
>> is stopped in the scheduler.
>
> So I assume (checked actually) the return 0 below from kernel/sched/core.c:
> wait_task_inactive() is where it bails out:
>
> 3303                 while (task_running(rq, p)) {
> 3304                         if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
> 3305                                 return 0;
> 3306                         cpu_relax();
> 3307                 }
>
> Yet, the child task is always found in __TASK_TRACED state (as seen
> in crash dumps):
>
>> 101447  11342  13      ce3a8100      RU   0.0   10040   4412  strace
>   101450  101447   0      bb04b200      TR   0.0    2272   1136  kill_child
>   108261  101447   2      d0b10100      TR   0.0    2272    532  kill_child
> crash> task bb04b200 __state
> PID: 101450  TASK: bb04b200          CPU: 0   COMMAND: "kill_child"
>   __state = 8,
>
> crash> task d0b10100 __state
> PID: 108261  TASK: d0b10100          CPU: 2   COMMAND: "kill_child"
>   __state = 8,

That is weird.

>> At least on s390 it looks like there is a race between SIGKILL and
>> ptrace_check_attach.  That isn't good.
>>
>> Reading the code below there is something missing because I don't see
>> anything making ptrace calls, and ptrace_check_attach (which contains
>> the warning) only happens in the ptrace syscall.
>
> That is what I believe strace does when calling that code:
>
>  	strace -f -qq -e signal=none -e trace=sched_yield,/kill	./kill_child

Thank you.  That was my braino.

I will have to see if it reproduces for me on x86 (I don't have an
s390).  Perhaps if I can reproduce it I can guess what is going wrong.

So far it appears WARN_ON_ONCE has nothing to warn about yet it is
warning.

Eric



More information about the linux-um mailing list