[PATCH] rcu-tasks: Fix race in schedule and flush work

Thu Feb 17 09:43:30 PST 2022

On Thu, Feb 17, 2022 at 04:25:19PM +0100, Padmanabha Srinivasaiah wrote:
> While booting secondary CPUs, cpus_read_[lock/unlock] not keeping online
> cpumask stable. The transient online mask results in below calltrace.
> 
> [    0.324121] CPU1: Booted secondary processor 0x0000000001 [0x410fd083]
> [    0.346652] Detected PIPT I-cache on CPU2
> [    0.347212] CPU2: Booted secondary processor 0x0000000002 [0x410fd083]
> [    0.377255] Detected PIPT I-cache on CPU3
> [    0.377823] CPU3: Booted secondary processor 0x0000000003 [0x410fd083]
> [    0.379040] ------------[ cut here ]------------
> [    0.383662] WARNING: CPU: 0 PID: 10 at kernel/workqueue.c:3084 __flush_work+0x12c/0x138
> [    0.384850] Modules linked in:
> [    0.385403] CPU: 0 PID: 10 Comm: rcu_tasks_rude_ Not tainted 5.17.0-rc3-v8+ #13
> [    0.386473] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
> [    0.387289] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    0.388308] pc : __flush_work+0x12c/0x138
> [    0.388970] lr : __flush_work+0x80/0x138
> [    0.389620] sp : ffffffc00aaf3c60
> [    0.390139] x29: ffffffc00aaf3d20 x28: ffffffc009c16af0 x27: ffffff80f761df48
> [    0.391316] x26: 0000000000000004 x25: 0000000000000003 x24: 0000000000000100
> [    0.392493] x23: ffffffffffffffff x22: ffffffc009c16b10 x21: ffffffc009c16b28
> [    0.393668] x20: ffffffc009e53861 x19: ffffff80f77fbf40 x18: 00000000d744fcc9
> [    0.394842] x17: 000000000000000b x16: 00000000000001c2 x15: ffffffc009e57550
> [    0.396016] x14: 0000000000000000 x13: ffffffffffffffff x12: 0000000100000000
> [    0.397190] x11: 0000000000000462 x10: ffffff8040258008 x9 : 0000000100000000
> [    0.398364] x8 : 0000000000000000 x7 : ffffffc0093c8bf4 x6 : 0000000000000000
> [    0.399538] x5 : 0000000000000000 x4 : ffffffc00a976e40 x3 : ffffffc00810444c
> [    0.400711] x2 : 0000000000000004 x1 : 0000000000000000 x0 : 0000000000000000
> [    0.401886] Call trace:
> [    0.402309]  __flush_work+0x12c/0x138
> [    0.402941]  schedule_on_each_cpu+0x228/0x278
> [    0.403693]  rcu_tasks_rude_wait_gp+0x130/0x144
> [    0.404502]  rcu_tasks_kthread+0x220/0x254
> [    0.405264]  kthread+0x174/0x1ac
> [    0.405837]  ret_from_fork+0x10/0x20
> [    0.406456] irq event stamp: 102
> [    0.406966] hardirqs last  enabled at (101): [<ffffffc0093c8468>] _raw_spin_unlock_irq+0x78/0xb4
> [    0.408304] hardirqs last disabled at (102): [<ffffffc0093b8270>] el1_dbg+0x24/0x5c
> [    0.409410] softirqs last  enabled at (54): [<ffffffc0081b80c8>] local_bh_enable+0xc/0x2c
> [    0.410645] softirqs last disabled at (50): [<ffffffc0081b809c>] local_bh_disable+0xc/0x2c
> [    0.411890] ---[ end trace 0000000000000000 ]---
> [    0.413000] smp: Brought up 1 node, 4 CPUs
> [    0.413762] SMP: Total of 4 processors activated.
> [    0.414566] CPU features: detected: 32-bit EL0 Support
> [    0.415414] CPU features: detected: 32-bit EL1 Support
> [    0.416278] CPU features: detected: CRC32 instructions
> [    0.447021] Callback from call_rcu_tasks_rude() invoked.
> [    0.506693] Callback from call_rcu_tasks() invoked.
> 
> Link: https://lore.kernel.org/lkml/20220210184319.25009-1-treasure4paddy@gmail.com/T/
> Suggested-by: Paul E. McKenney <paulmck at kernel.org>
> Signed-off-by: Padmanabha Srinivasaiah <treasure4paddy at gmail.com>

Thank you, Padmanabha!  I have queued this with additional explanation
in the commit log as shown below.

If you have not already tested this by repeatedly rebooting, for
example continuously in an overnight test, could you please do so?
(I am assuming that your system can do an automated reboot loop, so no,
I am not asking you to stay up all night!)

Should this somehow recur with larger systems, in which the call starts
with (say) five CPUs online but has trouble with the sixth CPU, there
are several ways to address it:

1.	Make cpus_read_lock() act as expected.

2.	Make the first for_each_online_cpu() in schedule_on_each_cpu()
	instead be for_each_possible_cpu(), always do the INIT_WORK(),
	but invoke schedule_work_on() only if cpu_online(cpu).

3.	Other ideas here, when and if.

							Thanx, Paul

------------------------------------------------------------------------

commit f6823834aa2e2e581ca627238fa3ad3e0a727c08
Author: Padmanabha Srinivasaiah <treasure4paddy at gmail.com>
Date:   Thu Feb 17 16:25:19 2022 +0100

    rcu-tasks: Fix race in schedule and flush work
    
    While booting secondary CPUs, cpus_read_[lock/unlock] is not keeping
    online cpumask stable. The transient online mask results in below
    calltrace.
    
    [    0.324121] CPU1: Booted secondary processor 0x0000000001 [0x410fd083]
    [    0.346652] Detected PIPT I-cache on CPU2
    [    0.347212] CPU2: Booted secondary processor 0x0000000002 [0x410fd083]
    [    0.377255] Detected PIPT I-cache on CPU3
    [    0.377823] CPU3: Booted secondary processor 0x0000000003 [0x410fd083]
    [    0.379040] ------------[ cut here ]------------
    [    0.383662] WARNING: CPU: 0 PID: 10 at kernel/workqueue.c:3084 __flush_work+0x12c/0x138
    [    0.384850] Modules linked in:
    [    0.385403] CPU: 0 PID: 10 Comm: rcu_tasks_rude_ Not tainted 5.17.0-rc3-v8+ #13
    [    0.386473] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
    [    0.387289] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [    0.388308] pc : __flush_work+0x12c/0x138
    [    0.388970] lr : __flush_work+0x80/0x138
    [    0.389620] sp : ffffffc00aaf3c60
    [    0.390139] x29: ffffffc00aaf3d20 x28: ffffffc009c16af0 x27: ffffff80f761df48
    [    0.391316] x26: 0000000000000004 x25: 0000000000000003 x24: 0000000000000100
    [    0.392493] x23: ffffffffffffffff x22: ffffffc009c16b10 x21: ffffffc009c16b28
    [    0.393668] x20: ffffffc009e53861 x19: ffffff80f77fbf40 x18: 00000000d744fcc9
    [    0.394842] x17: 000000000000000b x16: 00000000000001c2 x15: ffffffc009e57550
    [    0.396016] x14: 0000000000000000 x13: ffffffffffffffff x12: 0000000100000000
    [    0.397190] x11: 0000000000000462 x10: ffffff8040258008 x9 : 0000000100000000
    [    0.398364] x8 : 0000000000000000 x7 : ffffffc0093c8bf4 x6 : 0000000000000000
    [    0.399538] x5 : 0000000000000000 x4 : ffffffc00a976e40 x3 : ffffffc00810444c
    [    0.400711] x2 : 0000000000000004 x1 : 0000000000000000 x0 : 0000000000000000
    [    0.401886] Call trace:
    [    0.402309]  __flush_work+0x12c/0x138
    [    0.402941]  schedule_on_each_cpu+0x228/0x278
    [    0.403693]  rcu_tasks_rude_wait_gp+0x130/0x144
    [    0.404502]  rcu_tasks_kthread+0x220/0x254
    [    0.405264]  kthread+0x174/0x1ac
    [    0.405837]  ret_from_fork+0x10/0x20
    [    0.406456] irq event stamp: 102
    [    0.406966] hardirqs last  enabled at (101): [<ffffffc0093c8468>] _raw_spin_unlock_irq+0x78/0xb4
    [    0.408304] hardirqs last disabled at (102): [<ffffffc0093b8270>] el1_dbg+0x24/0x5c
    [    0.409410] softirqs last  enabled at (54): [<ffffffc0081b80c8>] local_bh_enable+0xc/0x2c
    [    0.410645] softirqs last disabled at (50): [<ffffffc0081b809c>] local_bh_disable+0xc/0x2c
    [    0.411890] ---[ end trace 0000000000000000 ]---
    [    0.413000] smp: Brought up 1 node, 4 CPUs
    [    0.413762] SMP: Total of 4 processors activated.
    [    0.414566] CPU features: detected: 32-bit EL0 Support
    [    0.415414] CPU features: detected: 32-bit EL1 Support
    [    0.416278] CPU features: detected: CRC32 instructions
    [    0.447021] Callback from call_rcu_tasks_rude() invoked.
    [    0.506693] Callback from call_rcu_tasks() invoked.
    
    This commit therefore fixes this issue by applying a single-CPU
    optimization to the RCU Tasks Rude grace-period process.  The key point
    here is that the purpose of this RCU flavor is to force a schedule on
    each online CPU since some past event.  But the rcu_tasks_rude_wait_gp()
    function runs in the context of the RCU Tasks Rude's grace-period kthread,
    so there must already have been a context switch on the current CPU since
    the call to either synchronize_rcu_tasks_rude() or call_rcu_tasks_rude().
    So if there is only a single CPU online, RCU Tasks Rude's grace-period
    kthread does not need to anything at all.
    
    It turns out that the rcu_tasks_rude_wait_gp() function's call to
    schedule_on_each_cpu() causes problems during early boot.  During that
    time, there is only one online CPU, namely the boot CPU.  Therefore,
    applying this single-CPU optimization fixes early-boot instances of
    this problem.
    
    Link: https://lore.kernel.org/lkml/20220210184319.25009-1-treasure4paddy@gmail.com/T/
    Suggested-by: Paul E. McKenney <paulmck at kernel.org>
    Signed-off-by: Padmanabha Srinivasaiah <treasure4paddy at gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck at kernel.org>

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index d73e32d803438..f7fdb349f8b8e 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -955,6 +955,9 @@ static void rcu_tasks_be_rude(struct work_struct *work)
 // Wait for one rude RCU-tasks grace period.
 static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp)
 {
+	if (num_online_cpus() <= 1)
+		return;	// Fastpath for only one CPU.
+
 	rtp->n_ipis += cpumask_weight(cpu_online_mask);
 	schedule_on_each_cpu(rcu_tasks_be_rude);
 }