rcu_preempt detected stalls

Tue Aug 31 08:21:44 PDT 2021

Hi

When enabling CONFIG_PREEMPT and running the stress-ng scheduler class
tests on arm64 (xilinx zynqmp and imx imx8mm SoCs) we are observing the following.

[   62.578917] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:                                                              
[   62.585015]  (detected by 0, t=5253 jiffies, g=3017, q=2972)                                                                   
[   62.590663] rcu: All QSes seen, last rcu_preempt kthread activity 5254 (4294907943-4294902689), jiffies_till_next_fqs=1, root  
+->qsmask 0x0                                                                                                                     
[   62.603086] rcu: rcu_preempt kthread starved for 5258 jiffies! g3017 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1               
[   62.613246] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.                        
[   62.622359] rcu: RCU grace-period kthread stack dump:                                                                          
[   62.627395] task:rcu_preempt     state:R  running task     stack:    0 pid:   14 ppid:     2 flags:0x00000028                  
[   62.637308] Call trace:                                                                                                        
[   62.639748]  __switch_to+0x11c/0x190                                                                                           
[   62.643319]  __schedule+0x3b8/0x8d8                                                                                            
[   62.646796]  schedule+0x4c/0x108                                                                                               
[   62.650018]  schedule_timeout+0x1ac/0x358                                                                                      
[   62.654021]  rcu_gp_kthread+0x6a8/0x12b8                                                                                       
[   62.657933]  kthread+0x14c/0x158                                                                                               
[   62.661153]  ret_from_fork+0x10/0x18                                                                                           
[   62.682919] BUG: scheduling while atomic: stress-ng-hrtim/831/0x00000002                                                       
[   62.689604] Preemption disabled at:                                                                                            
[   62.689614] [<ffffffc010059418>] irq_enter_rcu+0x30/0x58                                                                       
[   62.698393] CPU: 0 PID: 831 Comm: stress-ng-hrtim Not tainted 5.10.42+ #5                                         
[   62.706296] Hardware name: Zynqmp new (DT)                                                                                        
[   62.710115] Call trace:                                                                                                        
[   62.712548]  dump_backtrace+0x0/0x240                                                                                          
[   62.716202]  show_stack+0x2c/0x38                                                                                              
[   62.719510]  dump_stack+0xcc/0x104                                                                                             
[   62.722904]  __schedule_bug+0x78/0xc8                                                                                          
[   62.726556]  __schedule+0x70c/0x8d8                                                                                            
[   62.730037]  schedule+0x4c/0x108                                                                                               
[   62.733259]  do_notify_resume+0x224/0x5d8                                                                                      
[   62.737259]  work_pending+0xc/0x2a4

The error results in OOM eventually.

RCU priority boosting does work around this issue but it seems to me
a workaround more than a fix (otherwise boosting would be enabled
by CONFIG_PREEMPT for arm64 I guess?).

The question is: is this an arm64 bug that should be investigated? or
is this some known corner case of running stress-ng that is already
understood?

thanks
Jorge