Loadavg accounting error on arm64

Peter Zijlstra peterz at infradead.org
Mon Nov 16 09:20:05 EST 2020


On Mon, Nov 16, 2020 at 01:37:21PM +0000, Mel Gorman wrote:
> On Mon, Nov 16, 2020 at 01:11:03PM +0000, Will Deacon wrote:

> > Anyway, setting all that aside, I do agree with you that the bitfield usage
> > in task_struct looks pretty suspicious. For example, in __schedule() we
> > have:
> > 
> > 	rq_lock(rq, &rf);
> > 	smp_mb__after_spinlock();
> > 	...
> > 	prev_state = prev->state;
> > 
> > 	if (!preempt && prev_state) {
> > 		if (signal_pending_state(prev_state, prev)) {
> > 			prev->state = TASK_RUNNING;
> > 		} else {
> > 			prev->sched_contributes_to_load =
> > 				(prev_state & TASK_UNINTERRUPTIBLE) &&
> > 				!(prev_state & TASK_NOLOAD) &&
> > 				!(prev->flags & PF_FROZEN);
> > 			...
> > 			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
> > 
> > where deactivate_task() updates p->on_rq directly:
> > 
> > 	p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;
> > 
> 
> It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better
> document ttwu()") which changed it. Not sure why that is and didn't
> think about it too deep as it didn't appear to be directly related to
> the problem and didn't have ordering consequences.

I'm confused; that commit didn't change deactivate_task(). Anyway,
->on_rq should be strictly under rq->lock. That said, since there is a
READ_ONCE() consumer of ->on_rq it makes sense to have the stores as
WRITE_ONCE().

> > __ttwu_queue_wakelist() we have:
> > 
> > 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
> > 
> > which can be invoked on the try_to_wake_up() path if p->on_rq is first read
> > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield
> > updates can race and cause the flags to be corrupted?
> > 
> 
> I think this is at least one possibility. I think at least that one
> should only be explicitly set on WF_MIGRATED and explicitly cleared in
> sched_ttwu_pending. While I haven't audited it fully, it might be enough
> to avoid a double write outside of the rq lock on the bitfield but I
> still need to think more about the ordering of sched_contributes_to_load
> and whether it's ordered by p->on_cpu or not.

The scenario you're worried about is something like:

	CPU0							CPU1

	schedule()
		prev->sched_contributes_to_load = X;
		deactivate_task(prev);

								try_to_wake_up()
									if (p->on_rq &&) // false
									if (smp_load_acquire(&p->on_cpu) && // true
									    ttwu_queue_wakelist())
										p->sched_remote_wakeup = Y;

		smp_store_release(prev->on_cpu, 0);



And then the stores of X and Y clobber one another.. Hummph, seems
reasonable. One quick thing to test would be something like this:


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7abbdd7f3884..9844e541c94c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -775,7 +775,9 @@ struct task_struct {
 	unsigned			sched_reset_on_fork:1;
 	unsigned			sched_contributes_to_load:1;
 	unsigned			sched_migrated:1;
+	unsigned			:0;
 	unsigned			sched_remote_wakeup:1;
+	unsigned			:0;
 #ifdef CONFIG_PSI
 	unsigned			sched_psi_wake_requeue:1;
 #endif



More information about the linux-arm-kernel mailing list