[PATCH] sched/rt: fix incorrect schedstats for rt thread

Thu Jan 8 23:24:47 PST 2026

On Thu, 2026-01-08 at 12:16 +0100, Peter Zijlstra wrote:
> On Thu, Jan 08, 2026 at 11:13:07AM +0800, Dengjun Su wrote:
> > For RT thread, only 'set_next_task_rt' will call
> > 'update_stats_wait_end_rt' to update schedstats information.
> > However, during the RT migration process,
> > 'update_stats_wait_start_rt' will be called twice, which
> > will cause the values of wait_max and wait_sum to be incorrect.
> 
> Right, that looses time. Also note that I think dl has the same
> issue.

Hi Peter,

Thanks for the feedback. Yes, sorry for miss dl class,
I will update it in V2.

> 
> > The specific output as follows:
> > $ cat /proc/6046/task/6046/sched | grep wait
> > wait_start                                   :             0.000000
> > wait_max                                     :        496717.080029
> > wait_sum                                     :       7921540.776553
> > 
> > Add 'update_stats_wait_end_rt' in 'update_stats_dequeue_rt' to
> > update schedstats information when dequeue_task.
> 
> This needs a few more words on why this is correct -- notably it took
> me
> a little time to find the 'task_on_rq_migrating()' case in
> __update_stats_wait_end() which makes this not actually 'end'.
> 
> But then the corresponding clause in __update_stats_wait_start()
> gives
> me a headache:
> 
>  'wait_start > prev_wait_start'
> 
> I mean, wtf. Should that not equally be using task_on_rq_migrating()
> ?
> 
> Can you please take a hard look at all that and fix up things
> all-round?
> 

A complete schedstats information update flow of migrate should be
__update_stats_wait_start() [enter queue A, stage 1] ->
__update_stats_wait_end()   [leave queue A, stage 2] ->
__update_stats_wait_start() [enter queue B, stage 3] ->
__update_stats_wait_end()   [start running on queue B, stage 4]

    Stage 1: prev_wait_start is 0, and in the end, wait_start records the
    time of entering the queue.
    Stage 2: task_on_rq_migrating(p) is true, and wait_start is updated to
    the waiting time on queue A.
    Stage 3: prev_wait_start is the waiting time on queue A, wait_start is
    the time of entering queue B, and wait_start is expected to be greater
    than prev_wait_start. Under this condition, wait_start is updated to
    (the moment of entering queue B) - (the waiting time on queue A).
    Stage 4: the final wait time = (time when starting to run on queue B)
    - (time of entering queue B) + (waiting time on queue A) = waiting
    time on queue B + waiting time on queue A.

The current problem is that stage 2 does not call __update_stats_wait_end
to update wait_start, which causes the final computed wait time = waiting
time on queue B + the moment of entering queue A, leading to incorrect
wait_max and wait_sum.

For __update_stats_wait_end(), task_on_rq_migrating(p) is needed to
distinguish between stage 2 and stage 4 because they involve different
processing flows, but for __update_stats_wait_start(), it is not necessary
to distinguish between stage 1 and stage 3.

As for adding the condition wait_start > prev_wait_start, I think it is
more like a mechanism to prevent statistical deviations caused by time
inconsistencies.

Thanks